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Introduction 

In 1998, Lucent Technologies and 
Motorola announced the formation of a 
joint development center dubbed 
“StarCore.” StarCore’s primary stated 
goal is to develop next-generation DSP 
processor cores which will be used by 
both Lucent and Motorola in their own 
chip-level products. In 1999, StarCore 
announced its first new architecture: the 
SC 100. The first implementation of the 
SClOO architecture is the SC140 core, 
and this core is the focus of Inside the 
StarCore SC140, a technical report pub¬ 
lished by BDTI in May 2000. 

StarCore states that the SClOO archi¬ 
tecture is scalable, and that it expects to 
create other cores based on the SClOO 
architecture with different complements 
of execution units than are included in 
the SC140 core. These cores may be 
assembly-code compatible with the 
SC 140, providing an upgrade path for 
customers. 

At the time this report was published, 
only Motorola had announced a product 
based on the SC140. This product. 


Motorola’s MSC8101, was announced in 
late 1999. Inside the StarCore SCI40 
evaluates the SC 140 core and the 
MSC8101 chip. 

In Inside the StarCore SC140, the 
technical staff of BDTI evaluates the 
DSP performance of the SC140 (and 
MSC8101) and explores how the SC140 
architecture addresses the needs of DSP 
applications. The report includes both a 
detailed qualitative analysis of the 
SC 140’s architecture, and a quantitative 
evaluation based on the performance of 
the SC140 and MSC8101 on a series of 
DSP benchmarks developed by BDTI. 

At the time this report was written, in 
mid-2000, initial MSC8101 devices 
based on the SC140 were expected to run 
at speeds of 300 MHz using a 1.5-volt 
supply. StarCore has fabricated an 
SC140-based evaluation chip that exe¬ 
cutes at 300 MHz, on which BDTI veri¬ 
fied its benchmark timing. 

The SC 140 is notable for its extremely 
high level of parallelism, even in com¬ 
parison to other VLIW-based proces¬ 
sors. The core provides this parallelism 
while targeting very low power con¬ 
sumption; it is the first VLIW-based 
DSP processor to attempt to combine 
low power consumption with very high 
performance. At its 300 MHz clock rate, 
the SC 140 is currently the fastest gen¬ 
eral-purpose DSP processor to be dem¬ 
onstrated in silicon. The SC 140 core is 
targeted at high-performance applica¬ 
tions, such as cellular base stations and 
gateways, and at portable applications 
such as cellular terminal devices. 


Scope 

Inside the StarCore SCMOis intended 
for anyone interested in understanding 
the DSP performance and capabilities of 
the SC140 or SC140-based products. It 
assumes a basic knowledge of DSP pro¬ 
cessor concepts and terms, both of which 
are covered in BDTI’s text, DSP Proces¬ 
sor Fundamentals. Inside the StarCore 
SCI40 is especially useful for electronic 
system designers, hardware and software 
engineers, processor designers, engi¬ 
neering managers, and product market¬ 
ing managers. It will aid in the 
assessment of the SC140’s suitability for 
a given application, and will allow engi¬ 
neers and systems designers to make 
informed decisions when considering the 
SC 140 for their latest designs. 

For comparison purposes, this report 
includes brief analyses of several other 
processors: Lucent Technologies’ 

DSP16xxx and Texas Instruments’ 
TMS320C54XX and TMS320C62xx. 

About BDTI 

Berkeley Design Technology, Inc. 
(BDTI) was founded in 1991 to assist 
companies in creating, selecting, and 
using DSP technology. The technical 
staff of BDTI has extensive experi¬ 
ence in the development of DSP¬ 
intensive software and hardware for 
commercial applications. BDTI offers 
a variety of technical products and 
services, including: 

• Published reports on DSP 
processors and technology 

• DSP software development services 

• Technical advisory services 

• Training 
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These processors have been included to 
give the reader insight into how the 
SC140 compares to other well-known 
DSP architectures. 


The SC140 Processor Core 

The SC 140 is a VLIW architecture, 
and can execute up to six instructions at 
a time. Instructions that are grouped for 
parallel execution are referred to as an 
“execution set’’ by StarCore. Instructions 
are scheduled for parallel execution at 
compile time by code-generation tools or 
by the assembly-language programmer. 

The SC140 contains four 16-bit data 
paths, each of which contains a com¬ 
bined ALU/MAC/bit-field unit (BFU). 
The BFU contains a 40-bit barrel shifter. 
All of the data paths are identical, and 
share a common set of 16 source and 
destination registers. 

The MAC units, ALUs, and BFUs that 
comprise each of the four data paths are 
not independent (in contrast to other 
VLIW processors, which typically have 
independent MAC, ALU, and shifter 
units). Hence, it is not possible, for 
example, to issue a set of instructions 
that uses all four MAC units and also one 
of the BFUs. For this reason, in each 
group of six instructions executed in par¬ 
allel, only four can use the data paths. 
The remaining two instructions in an 
execution set can use the address gener¬ 
ation unit to perform data moves, pointer 
arithmetic, or bit mask operations; or 
they can specify program flow-control 
instructions. 

The SC140’s four data paths can each 
perform single-cycle 16xl6-bit multi¬ 
plications. The multipliers support all 
combinations of signed and unsigned 
operands, and support fractional and 
integer formats (both operands must be 
integer or fractional). Each data path 
supports SIMD-style addition and sub¬ 
traction (using the ADD2 and SUB2 
instructions) by treating values in regis¬ 
ters as packed pairs of 16-bit operands. 
For example, using SIMD operations, 
the SC 140 can perform eight 16-bit addi¬ 
tions per instruction cycle. 

High Data Memory Bandwidth 

The SC140 has two 32-bit address 
buses and two 64-bit data buses for trans¬ 


ferring data. Instructions are fetched via 
a 32-bit address bus and 128-bit data bus. 
Program and data memory is unified; 
any address can contain either instruc¬ 
tions or data. 

The SC 140 can perform two data 
reads, two data writes, or one read and 
one write per instruction cycle. Each 
read or write can access contiguous 
groups of data up to 64 bits wide. On a 
300 MHz SC140, the maximum on-core 
data memory bandwidth is therefore 
2400 million 16-bit words/second. The 
SC 140 has much higher on-chip data 
memory bandwidth than most other DSP 
processors. Its memory bandwidth 
should be sufficient to keep the execu¬ 
tion units supplied with data and avoid 
data memory bottlenecks when the pro¬ 
cessor uses data in on-chip memory. 

Instruction Set 

The SC140 can fetch eight 16-bit 
instruction words per cycle and can exe¬ 
cute up to six instructions in parallel (the 
remaining two words can be used for 
prefixes, described below, or for imme¬ 
diate values). Each instruction in the exe¬ 
cution set uses one execution unit. 

Two different methods are used for 
specifying which instructions will be 
included in an execution set: serial 
grouping and prefix grouping. 

Serial grouping uses the two most sig¬ 
nificant bits in the instruction to deter¬ 
mine the end of an execution set. 

Prefix grouping adds a one-word or 
two-word prefix to an execution set. The 
prefix defines how many instructions are 
included in the execution set, and also 
contains information used for condi¬ 
tional execution and looping. Prefix 
grouping must be used if instmctions are 
to be executed conditionally. 

Instructions within the same execution 
set always start execution at the same 
time. A new execution set begins execu¬ 
tion only after all instructions belonging 
to previous execution sets are completed. 
Therefore, the time required to complete 
an execution set is determined by the 
instruction in the set that requires the 
most time. 

The SC 140 instruction set is quite 
orthogonal, because most of the instruc¬ 
tions are simple and specify a single 


operation. Unlike some VLIW proces¬ 
sors, the SC140’s instruction set is com¬ 
posed of relatively short (16-bit) 
instructions whose functionality can be 
extended using prefixes or extensions. 
Short instructions often require proces¬ 
sor architects to place restrictions on, 
e.g., register usage; the SC140 avoids the 
need for register restrictions by using 
prefix words where needed. With the 
exception of special instructions for Vit- 
erbi decoding, all SC 140 instructions can 
use all registers without restriction. 
There are a number of restrictions on 
grouping instructions in execution sets, 
however, which complicate assembly 
language programming. 


Pipeline 

The SC140 processor uses a five-stage 
pipeline consisting of pre-fetch, fetch, 
dispatch, address generation, and exe¬ 
cute stages. The SC140 pipeline is not 
interlocked; however, the assembler 
detects pipeline hazards and issues warn¬ 
ings. In comparison to other VLIW- 
based DSP processors, such as Carmel 
and the TMS320C62xx, the SC140’s 
pipeline is quite short. The SC 140’s five- 
stage pipeline is benign, and does not 
seriously complicate programming. 
Pipeline hazards can be avoided with 
very little programming effort. 

Addressing 

The SC140 provides one address gen¬ 
eration unit (AGU) that contains two 
address arithmetic units (AAU), a bit 
mask unit (BMU), and a set of address¬ 
ing registers. The AGU is capable of 
generating two addresses per instruction 
cycle, and provides 16 primary registers 
(R0-R15). Like many DSP processors, 
the SC 140’s maximum memory band¬ 
width can only be achieved when data is 
arranged in groups of contiguous words 
in memory, since the processor is only 
capable of generating two addresses at a 
time. 


Benchmark Performance 

Inside the StarCore SC140 includes 
extensive benchmark results, used to 
quantitatively evaluate the processor’s 
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DSP performance. For each benchmark, 
BDTI reports cycle counts, execution 
times, energy consumption, cost-perfor¬ 
mance, and memory usage. BDTI also 
provides extensive analysis of why the 
processors perform as they do. In this 
section, we present sample execution 
time and memory usage results 
excerpted from the complete set of 
results in the report. 

Execution Time 

The execution time for a BDTI Bench¬ 
mark function is defined as the amount 
of time required by the processor to com¬ 
plete the benchmark’s initialization, ker¬ 
nel, and termination sections. To 
determine the execution time of a partic¬ 
ular benchmark on a given processor, the 
number of instruction cycles the proces¬ 
sor requires to execute the benchmark is 
multiplied by the processor’s instmction 
cycle time. Inside the StarCore SC140 
includes tables and charts illustrating the 
number of cycles required by each pro¬ 
cessor to execute each benchmark, and 
uses these results to generate corre¬ 
sponding tables and charts of execution 
times at a specified clock speed. For the 
SC140, the clock speed used for deter¬ 
mining execution times is the projected 
speed for Motorola’s first SC140-based 
chip, the MSC8101. 

Sample Benchmark Results 
The execution time results for BDTI’s 
Viterbi decoder benchmark are shown in 

About the BDTI Benchmarks™ 

The BDTI Benchmarks are a set of 
DSP software functions that BDTI has 
independently designed to provide an 
objective basis for comparing proces¬ 
sor performance characteristics such 
as speed and memory use for DSP 
applications. The BDTI Benchmark 
functions are implemented in assem¬ 
bly language to allow a realistic 
assessment of processors’ DSP perfor¬ 
mance. The resulting software is then 
verified for functional correctness, 
optimality, and adherence to the BDTI 
Benchmark specifications. Benchmark 
performance results are obtained 
through manual analysis and careful, 
detailed simulation, or by measure¬ 
ment on sample devices. 


Viterbi Decoder Benchmark 
Execution Time (ms) 
(iower is better) 
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Except for the MSC8101, speeds 
shown are for the fastest chips 
sampling as of June, 2000, 
according to the vendors. BDTI 
has not verified that these speeds 
have been achieved. The 
MSC8101 is expected to be sam¬ 
pling at 300 MHz in August, 2000. 


* Results shown are for 
one of the two on-chip 
cores 



the figure above. As illustrated in this 
figure, the MSC8101 (Motorola’s 
SC140-based chip) has a significantly 
faster result on this benchmark than any 
of the other processors shown. The 
SC 140 core has instructions dedicated to 
Viterbi decoding (described in more 
detail in the full report), and is able to 
efficiently implement the addition-inten¬ 
sive portion of the algorithm via its sup¬ 
port for eight 16-bit additions per cycle. 
The processor’s use of multiple execu¬ 
tion units and specialized instructions 
results in an extremely low cycle count 
for the Viterbi benchmark; the MSC8101 
consumes fewer than one-third of the 
cycles required by the TMS320C6203. 
Its architectural efficiency combined 
with its high clock rate enable the 
MSC8101 to achieve a very strong result 
on this benchmark. 

Memory Usage 

Speed is often the first metric designers 
use to compare processors. Memory use 
is also of interest, however, for several 
reasons. For example, memory use may 
have a significant impact on overall sys¬ 
tem cost. Memory use can also affect 
processors’ performance; if application 
software and data cannot fit entirely in 


on-chip memory, significant perfor¬ 
mance degradation may occur on many 
processors. Because of these and other 
factors, memory use is an important met¬ 
ric for processor selection. For each 
benchmark, BDTI reports each proces¬ 
sor’s program, constant data, and non¬ 
constant data memory use. 

Most of the BDTI Benchmarks are 
optimized first for maximum speed, then 
for minimum memory usage, because 
this is usually the order of priorities in 
DSP applications. The exception to this 
rule is the Control benchmark, described 
below. 

Control Benchmark 

The BDTI Benchmarks include one 
benchmark function specifically 
designed to evaluate memory use for 
control-oriented programs. Control-ori¬ 
ented code usually takes up the bulk of a 
DSP application’s memory requirements 
but only a small fraction of the applica¬ 
tion’s processing time. Thus in control- 
oriented code, memory use is usually a 
more serious concern than execution 
speed. 

BDTI’s Control benchmark is designed 
to be representative of control-oriented 
code. The primary goal for programmers 
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